Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention by elvischenv · Pull Request #36858 · vllm-project/vllm

elvischenv · 2026-03-12T06:52:56Z

Purpose

Support Flashinfer RoPE+Quant+KV Cache Update fusion kernel rope_quantize_fp8_append_paged_kv_cache.

Depend on flashinfer-ai/flashinfer#2792: Fixed the padding token issue for the kernel when using full cudagraph

Test Plan && Test Result

Fusion pass unit test

pytest -v -s tests/compile/passes/test_rope_kvcache_fusion.py::test_rope_quant_kvcache_fusion

===== 24 passed, 41 warnings in 93.37s (0:01:33) ========

Model e2e accuracy

Server cmd:

VLLM_USE_FLASHINFER_ROPE=1 \
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
\
vllm serve \
openai/gpt-oss-120b \
--tensor-parallel-size 8 \
-cc.use_inductor_graph_partition=True \
-cc.pass_config.fuse_allreduce_rms=True \
-cc.pass_config.eliminate_noops=True \
-cc.pass_config.fuse_rope_kvcache=True \
--async-scheduling \
--no-enable-prefix-caching \
--kv-cache-dtype fp8 \
--stream-interval 20 \
--max-num-seqs 1024 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--max-cudagraph-capture-size 2048

Fused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_204508', 'metric': 0.7922979797979798}]

Infused:

[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-high_temp1.0_20260315_210654', 'metric': 0.7891414141414141}]

Model e2e perf

Fused: about 5% perf gain for GPT-OSS-120b TP8 con8

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  29.01
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.76
Output token throughput (tok/s):         2824.22
Peak output token throughput (tok/s):    152.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5648.44
---------------Time to First Token----------------
Mean TTFT (ms):                          53.85
Median TTFT (ms):                        55.84
P99 TTFT (ms):                           86.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.78
Median TPOT (ms):                        2.78
P99 TPOT (ms):                           2.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.73
Median ITL (ms):                         55.41
P99 ITL (ms):                            57.44
==================================================

Infused:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             8
Benchmark duration (s):                  30.50
Total input tokens:                      81920
Total generated tokens:                  81920
Request throughput (req/s):              2.62
Output token throughput (tok/s):         2686.20
Peak output token throughput (tok/s):    145.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          5372.41
---------------Time to First Token----------------
Mean TTFT (ms):                          58.81
Median TTFT (ms):                        63.80
P99 TTFT (ms):                           85.12
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.92
Median TPOT (ms):                        2.92
P99 TPOT (ms):                           2.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49
Median ITL (ms):                         58.44
P99 ITL (ms):                            59.91
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This PR introduces support for Flashinfer's fused RoPE, quantization, and KV cache update kernel, which is a great performance optimization for FP8 models on CUDA. The changes are well-structured, adding a new RopeQuantReshapeKVCachePattern to handle the fusion and updating related components to support it.

However, I've found a critical issue in vllm/v1/attention/backends/flashinfer.py where a check for KV cache sharing was removed, which could lead to incorrect behavior for models that use this feature. Please see my comment for details.

ProExpertProg

I see now, the kernel requires attention metadata which is not built during PIECEWISE warmup/capture. We can keep it excluded for now but we should collect some perf numbers for this kernel inside/outside cudagraphs to see how much this hurts us. And we should only exclude it for FlashInfer

ProExpertProg · 2026-03-27T15:22:36Z

@@ -205,13 +322,29 @@ def __init__(self, config: VllmConfig) -> None:
        self.max_token_num = cc.pass_config.rope_kvcache_fusion_max_token_num

        attn_layers = get_layers_from_vllm_config(config, Attention)
-        for _, layer in attn_layers.items():
-            if layer.impl.fused_rope_kvcache_supported():
+        if current_platform.is_cuda():


Can we consolidate this:

for layer in ... if not layer.supported() continue for is_neox in [True, False]: if is_cuda() for use_flashinfer_rope in [True, False]: RopeQuantReshapeKVCachePattern(...).register() if is_rocm(): RopeReshapeKVCachePattern(...).register()

ProExpertProg · 2026-03-27T15:24:13Z

@@ -1005,6 +1005,13 @@ def set_splitting_ops_for_v1(
                # list via reference.
                self.splitting_ops = list(self._attention_ops)

+                # Like attn op, fuse_rope_kvcache op also needs to be a splitting op


attn metadata access does not matter here. What matters is whether the tensors and shapes are static - can we make them so so this doesn't need to be excluded from CG?

ProExpertProg · 2026-03-27T15:25:08Z

@@ -83,7 +83,6 @@ def __init__(
        self.rotary_emb = get_rope(
            self.head_dim,
            max_position=config.max_position_embeddings,


Why is this needed?

dtype=torch.float32 mainly controls the type of cos_sin_cache. But in the runtime forward, it will always be converted into the same type with query by _match_cos_sin_cache_dtype, so dtype=torch.float32 has no effect but delay the conversion to runtime.

vllm/vllm/model_executor/layers/rotary_embedding/base.py

Lines 182 to 190 in d28d86e

def forward_native(

self,

positions: torch.Tensor,

query: torch.Tensor,

key: torch.Tensor | None = None,

) -> tuple[torch.Tensor, torch.Tensor | None]:

"""A PyTorch-native implementation of forward()."""

cos_sin_cache = self._match_cos_sin_cache_dtype(query)

return self.forward_static(

ProExpertProg · 2026-03-27T15:27:21Z

@@ -1148,6 +1187,23 @@ def build(
                    disable_split_kv=self.disable_split_kv,
                )
                attn_metadata.decode = FIDecode(wrapper=decode_wrapper)
+
+        # Step 4: Pre-compute params for RoPE + FP8 quantize + KV cache update fusion


These look cudagraph-safe to me?

Yes, I have tested with diff cudagraph mode and it currently works with cudagraph_mode=NONE/FULL_DECODE_ONLY/FULL_AND_PIECEWISE. For supporting FULL_AND_PIECEWISE it requires the op excluded from the piecewise graph since it needs to access attn_metadata.

ProExpertProg · 2026-03-27T15:28:44Z

+        query_quant_scale: torch.Tensor | None = None,
+        query_quant_out: torch.Tensor | None = None,
+    ):
+        if attn_metadata is None:


This means this would not work in piecewise cudagraphs?

ProExpertProg · 2026-03-27T15:29:40Z

+        if attn_metadata is None:
+            # Profiling run.
+            return


This will prevent AITER rope-cache from being included in piecewise cudagraphs which we definitely don't want.

ProExpertProg · 2026-03-27T15:30:28Z

@@ -754,9 +754,9 @@ def fused_output_quant_supported(self, quant_key: "QuantKey"):
        """
        return False

-    def fused_rope_kvcache_supported(self):
+    def fused_rope_kvcache_supported(self, quant_key: "QuantKey | None" = None):


Nit: can you specify the quant is for query? Maybe call it query_quant_key?

mergify · 2026-03-30T01:46:31Z

Hi @elvischenv, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

elvischenv

Hi @ProExpertProg, I have resolved most of the above comments. Could you help review again? Thanks!

I see now, the kernel requires attention metadata which is not built during PIECEWISE warmup/capture. We can keep it excluded for now but we should collect some perf numbers for this kernel inside/outside cudagraphs to see how much this hurts us. And we should only exclude it for FlashInfer

Regarding to benchmarking with kernel inside/outside cudagraphs, I am not sure what this means. This kernel needs to assess attn_metadata so it cannot be added to piecewise cudagraph. It is already included in the full decode cudagraph. Can you elaborate on this?

elvischenv · 2026-03-30T02:00:02Z

@@ -83,7 +83,6 @@ def __init__(
        self.rotary_emb = get_rope(
            self.head_dim,
            max_position=config.max_position_embeddings,


dtype=torch.float32 mainly controls the type of cos_sin_cache. But in the runtime forward, it will always be converted into the same type with query by _match_cos_sin_cache_dtype, so dtype=torch.float32 has no effect but delay the conversion to runtime.

vllm/vllm/model_executor/layers/rotary_embedding/base.py

Lines 182 to 190 in d28d86e

def forward_native(

self,

positions: torch.Tensor,

query: torch.Tensor,

key: torch.Tensor | None = None,

) -> tuple[torch.Tensor, torch.Tensor | None]:

"""A PyTorch-native implementation of forward()."""

cos_sin_cache = self._match_cos_sin_cache_dtype(query)

return self.forward_static(

elvischenv · 2026-03-30T02:07:02Z

@@ -1148,6 +1187,23 @@ def build(
                    disable_split_kv=self.disable_split_kv,
                )
                attn_metadata.decode = FIDecode(wrapper=decode_wrapper)
+
+        # Step 4: Pre-compute params for RoPE + FP8 quantize + KV cache update fusion


Yes, I have tested with diff cudagraph mode and it currently works with cudagraph_mode=NONE/FULL_DECODE_ONLY/FULL_AND_PIECEWISE. For supporting FULL_AND_PIECEWISE it requires the op excluded from the piecewise graph since it needs to access attn_metadata.

elvischenv · 2026-03-30T02:07:51Z

@@ -205,13 +322,29 @@ def __init__(self, config: VllmConfig) -> None:
        self.max_token_num = cc.pass_config.rope_kvcache_fusion_max_token_num

        attn_layers = get_layers_from_vllm_config(config, Attention)
-        for _, layer in attn_layers.items():
-            if layer.impl.fused_rope_kvcache_supported():
+        if current_platform.is_cuda():


mergify · 2026-04-01T01:10:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ElizaWszola · 2026-04-01T13:05:33Z

+        # Compute slot_mapping consistent with block_table:
+        slots = []
+        for i in range(batch_spec.batch_size):
+            context_len = batch_spec.seq_lens[i] - batch_spec.query_lens[i]
+            for j in range(batch_spec.query_lens[i]):
+                global_pos = context_len + j
+                physical_block = block_table_tensor[i, global_pos // block_size].item()
+                slots.append(physical_block * block_size + global_pos % block_size)
+        slot_mapping = torch.tensor(slots, dtype=torch.int64, device=device)


is this change a general fix or is it something required specifically for this PR?

This is a general fix for the baseline(infused path).

mgoin · 2026-04-07T12:24:59Z

@elvischenv can you fix the merge conflicts please? I also think some of the fusion failures are related

mergify · 2026-04-13T03:09:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update unit test Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> resolve issue Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Apply suggestions from code review Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv · 2026-04-13T06:49:15Z

can you fix the merge conflicts please? I also think some of the fusion failures are related

@mgoin Fixed the conflicts.
The failed fusion tests(https://buildkite.com/vllm/ci/builds/59051/steps/canvas?sid=019d4474-4df4-40ab-82fb-18427374074c) can be worked around by temporarily disabled rope fusion for these tests(42731a1).
The root cause should be related to the compilation range splited by rope fusion.

vllm/vllm/compilation/passes/fusion/rope_kvcache_fusion.py

Lines 270 to 274 in 0e39202

    
           def is_applicable_for_range(self, compile_range: Range) -> bool: 
        
               # This pass works best for the small-batch decode setting. 
        
               # For large-batch e.g. prefill, it is better to use two separate kernels 
        
               # since they are compute bound and the fused kernels require further tuning. 
        
               return compile_range.end <= self.max_token_num

There are some hardcoding in conftest.py and may need some fixes:

vllm/tests/compile/fusions_e2e/conftest.py

Lines 145 to 218 in 1f5ec28

    
           # Now check the matches 
        
           for match_name in matches_check: 
        
               log_matches = list(int(ms) for ms in log_matches_dict[match_name]) 
        
               # AR+RMS skips the largest range; SP skips the smallest. 
        
               # When both are enabled, AR+RMS activation count is 
        
               # model-dependent (hidden_size affects threshold), so derive 
        
               # from log data. 
        
               if ( 
        
                   match_name == "ar_rms_fusion" 
        
                   and "sequence_parallel" in matches_check 
        
                   and num_compile_ranges >= 2 
        
               ): 
        
                   assert ( 
        
                       len(log_matches) >= tp_size and len(log_matches) % tp_size == 0 
        
                   ), ( 
        
                       f"Expected multiple of {tp_size} ar_rms log entries, " 
        
                       f"found {len(log_matches)}" 
        
                   ) 
        
                   num_ranges_activated = len(log_matches) // tp_size 
        
               elif ( 
        
                   match_name in ("ar_rms_fusion", "sequence_parallel") 
        
                   and num_compile_ranges >= 2 
        
               ): 
        
                   num_ranges_activated = num_compile_ranges - 1 
        
               else: 
        
                   num_ranges_activated = num_compile_ranges 
        
               # TODO: Remove log counting in unit tests 
        
               # once all matchers implement VllmFusionPatternMatcherPass 
        
               n_expected = tp_size * num_ranges_activated 
        
               if match_name != "attn_quant_fusion": 
        
                   assert len(log_matches) == n_expected, ( 
        
                       f"Could not find {n_expected} {match_name} " 
        
                       f"(found {len(log_matches)}) in:\n {log_holder.text}" 
        
                   ) 
        
               expected_matches = getattr(matches, match_name) 
        
               if match_name == "rms_quant_fusion" and "ar_rms_fusion" in matches_check: 
        
                   # AR+rms+quant takes precedence over rms+quant if activated. 
        
                   # That means we get full matching where ar+rms+quant was not 
        
                   # activated, and less where it was (only the smallest range). 
        
                   assert sum(m == expected_matches for m in log_matches) == tp_size * ( 
        
                       num_ranges_activated - 1 
        
                   ), "Expecting full rms+quant fusion where ar+rms+quant not activated" 
        
                   assert all( 
        
                       expected_matches - matches.ar_rms_fusion <= m <= expected_matches 
        
                       for m in log_matches 
        
                   ), ( 
        
                       f"Expecting at least {expected_matches - matches.ar_rms_fusion} " 
        
                       f"where ar+rms+quant was activated" 
        
                   ) 
        
               elif ( 
        
                   match_name == "async_tp" 
        
                   and "sequence_parallel" in matches_check 
        
                   and num_compile_ranges >= 2 
        
               ): 
        
                   # AsyncTP only finds patterns on ranges where SP ran. 
        
                   n_sp_ranges = num_compile_ranges - 1 
        
                   assert ( 
        
                       sum(m == expected_matches for m in log_matches) 
        
                       == tp_size * n_sp_ranges 
        
                   ), ( 
        
                       f"Expecting {expected_matches} async_tp on " 
        
                       f"{tp_size * n_sp_ranges} SP-range entries, " 
        
                       f"found: {log_matches}" 
        
                   ) 
        
                   assert sum(m == 0 for m in log_matches) == tp_size, ( 
        
                       f"Expecting 0 async_tp on {tp_size} small-range entries " 
        
                       f"(no SP), found: {log_matches}" 
        
                   ) 
        
               elif (

@ProExpertProg Could you look into this after this PR merge to main? Thanks.

ProExpertProg

Does this work without inductor partition? My understanding is that it won't work because the piecewise cudagraphs will simply skip the fused op because attention metadata is not set during piecewise capture.

@mgoin and I discussed this and we think a short-term fix could be to either set attention metadata during piecewise capture and make sure attention doesn't run, or just call the unfused kernel inside the fused op if metadata isn't set.

The proper long-term fix (proposed by @LucasWilkinson) would be to use static buffers and either access them through new metadata for kvcache update which includes the slot mapping, or just read them from the layer.

Could you try the long-term fix first?

ProExpertProg · 2026-04-15T15:07:35Z

        dtype: torch.dtype,
        device: torch.device,
        prefix: str = "model.layers.0.self_attn.attn",
+        attn_backend: AttentionBackendEnum = None,


Why is this ever None?

ProExpertProg · 2026-04-15T15:14:33Z

+            view_to_reshape(gm)
+            return gm
+
+        pm.register_replacement(


In the case where we're using the layername wildcard, we should add an extra_check method that checks the fusion support for that layer.

Can we actually separate the closure and input ones into separate pattern/replacement classes? They can share a base

ProExpertProg · 2026-04-15T15:16:14Z

-    fuse_rope_kvcache: bool = None  # type: ignore[assignment]
-    """Fuse the QK rope + KV cache ops."""

    rope_kvcache_fusion_max_token_num: int = 256


Should we use the same threshold for this kernel? This was defaulted because the AITER kernel is slower than unfused above 256 tokens

ProExpertProg · 2026-04-15T15:17:30Z

+                    and self.use_inductor_graph_partition
+                    and self.pass_config.fuse_rope_kvcache
+                ):
+                    self.splitting_ops.append(


This will work with inductor graph partition. Without it, fused_rope_and_unified_kv_cache_update will remain in the piecewise graph (necessary to perform fusion). But it won't be captured in piecewise cudagraphs because it will be skipped as attention metadata is not set

ProExpertProg · 2026-04-15T15:39:27Z

            fuse_attn_quant=True,
            enable_qk_norm_rope_fusion=True,
            fuse_allreduce_rms=True,
+            fuse_rope_kvcache=False,  # FIXME: disable to avoid compile range split


Instead of disabling the rope-cache fusion in tests, can we adjust the compile range logic?

LucasWilkinson · 2026-04-15T16:09:43Z

To expand I think we can do something like:

_static_arange = torch.arange(max_num_batched_tokens, device=...)
_static_zeros = torch.zero((max_num_batched_tokens,), device=...)

then

num_toks = layer_slot_mapping.shape[0]
rope_quantize_fp8_append_paged_kv_cache(
    ...
    paged_kv_cache=(k_cache.view(...), v_cache).view(...)), # view as page size 1
    paged_kv_indices=layer_slot_mapping
    kv_indptr=_static_arange[:2], # all tokens in one request
    batch_indices=layer_slot_mapping.clamp(max=0), # account for -1 padded slot mapping
    positions=_static_arange[:num_toks]
    ...
)

not sure how to move the clamp of the hotpath though

edit: actually we might run into issues for HND for "view as page size 1"

LucasWilkinson · 2026-04-15T16:23:28Z

actually i think the easiest would be to just move https://github.com/flashinfer-ai/flashinfer/blob/bf9b1dac855005ffaa57b48ae54cba30642bf213/include/flashinfer/pos_enc.cuh#L800-L1036 into vLLM and modify it to use a slot mapping (and support HND) instead of

    paged_kv_indices=...,
    kv_indptr=...,
    batch_indices=...,
    positions=...,

mergify bot added nvidia rocm Related to AMD ROCm v1 labels Mar 12, 2026

github-project-automation bot added this to NVIDIA and AMD Mar 12, 2026

github-project-automation bot moved this to Todo in AMD Mar 12, 2026

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/flashinfer.py

elvischenv mentioned this pull request Mar 12, 2026

[Performance]: ROPE + KV-Cache-Write + pre-attn prepare-ops fusion #24678

Open

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from 76992c4 to ed31eaa Compare March 16, 2026 03:15

mergify bot added the gpt-oss Related to GPT-OSS models label Mar 16, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 16, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Mar 16, 2026

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from ed31eaa to cb4d5e7 Compare March 16, 2026 03:16

elvischenv marked this pull request as ready for review March 16, 2026 04:55

elvischenv requested review from ProExpertProg, WoosukKwon, alexm-redhat, gshtras, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tdoublep, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners March 16, 2026 04:55

mergify bot removed the needs-rebase label Mar 16, 2026

ProExpertProg reviewed Mar 27, 2026

View reviewed changes

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from dd6afc1 to 89ffb62 Compare March 30, 2026 01:42

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from 89ffb62 to b069728 Compare March 30, 2026 01:51

elvischenv commented Mar 30, 2026

View reviewed changes

mgoin reviewed Mar 30, 2026

View reviewed changes

Comment thread vllm/config/compilation.py Outdated

nvpohanh mentioned this pull request Mar 31, 2026

[Tracking Issue][Performance] GPT-OSS B200/GB200 performance optimization tracker #30758

Open

11 tasks

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2026

mergify bot added the needs-rebase label Apr 1, 2026

ElizaWszola reviewed Apr 1, 2026

View reviewed changes

This was referenced Apr 10, 2026

BaseKVCacheMethod.apply_kv_cache captainpete/vllm#2

Open

Deterministic Hadamard KQ rotation captainpete/vllm#1

Open

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from d13d4ea to 696f8f6 Compare April 13, 2026 02:00

elvischenv requested a review from vadiklyutiy as a code owner April 13, 2026 02:00

mergify bot removed the needs-rebase label Apr 13, 2026

mergify bot added the needs-rebase label Apr 13, 2026

elvischenv and others added 4 commits April 12, 2026 23:30

add VLLM_USE_FLASHINFER_ROPE

d25bfda

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

add unit test

6ad90be

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update unit test Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

fix unit test

42731a1

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/flashinfer-rope-quant-cache-fusion branch from 696f8f6 to 42731a1 Compare April 13, 2026 06:31

mergify bot removed the needs-rebase label Apr 13, 2026

ProExpertProg reviewed Apr 15, 2026

View reviewed changes

	def forward_native(
	self,
	positions: torch.Tensor,
	query: torch.Tensor,
	key: torch.Tensor \| None = None,
	) -> tuple[torch.Tensor, torch.Tensor \| None]:
	"""A PyTorch-native implementation of forward()."""
	cos_sin_cache = self._match_cos_sin_cache_dtype(query)
	return self.forward_static(

Uh oh!

Conversation

elvischenv commented Mar 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan && Test Result

Fusion pass unit test

Model e2e accuracy

Model e2e perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 30, 2026

Uh oh!

elvischenv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Apr 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgoin commented Apr 7, 2026

Uh oh!

mergify bot commented Apr 13, 2026

Uh oh!

elvischenv commented Apr 13, 2026

Uh oh!

ProExpertProg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

elvischenv commented Mar 12, 2026 •

edited by github-actions bot

Loading

ProExpertProg left a comment •

edited

Loading

LucasWilkinson commented Apr 15, 2026 •

edited

Loading